Skip to content

hw: fix WMMA fp16/bf16 RTL output handling#359

Open
cassuto wants to merge 4 commits into
vortexgpgpu:masterfrom
cassuto:fix_wmma_bf16_fp16
Open

hw: fix WMMA fp16/bf16 RTL output handling#359
cassuto wants to merge 4 commits into
vortexgpgpu:masterfrom
cassuto:fix_wmma_bf16_fp16

Conversation

@cassuto
Copy link
Copy Markdown
Contributor

@cassuto cassuto commented Jun 3, 2026

Issue

WMMA fp16->fp16 and bf16->bf16 did not follow fmt_d. The RTL always treated the accumulator input/output path as FP32.

Root Cause

VX_tcu_fedp_bhf was missing destination-format handling. When fmt_d selected fp16 or bf16, the RTL still produced FP32-formatted bits instead of the expected 16-bit fp16/bf16 result in the low halfword.

Proposal

Add fmt_d handling in the BHF TCU datapath so fp16 and bf16 WMMA outputs are rounded and packed in the expected destination format. Extend the SGEMM TCU regression coverage to include fp16->fp16 and bf16->bf16 cases, with ULP checks in the native 16-bit encoding space.
This fix passes the synthesis testing.

@cassuto cassuto force-pushed the fix_wmma_bf16_fp16 branch from ff0fbc4 to ce6d361 Compare June 3, 2026 08:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant